Financial Data Structure¶

This notebook is based on the first chapter of Marcos Lopez de Prado’s book: "Advances in Financial Machine Learning" - Financial Data Structures.

With regard to data-driven in finance, the important thing is not only what data is available but also how it can be approached. Many advantages given from Programmatic availability of data is now disserminated via application programing interface (APIs) that allow the usage of computer code to do ETL tasks bringing information in financial world. In general, financial data is amorphous, which can come from in many shapes and forms depending on purposes of finance practitioners' use and requirement.

In the table, structured data refers to numerical data types that often come in tabular structures while unstructured data is represented as standard text that often has no structures (headers, paragraphs, etc). When it comes to market data, wich includes all trading activity that takes place in an exchange or trading venue, eg stock data. This is often shown as historical time-series data in the form of rows refered as "bars".

What is "bars"?¶

The first step to approach any further analysis questions is to aggregate data into a convenient and perspicuous format. Bars is chart illustrations represented most basic infomation about changes in value of financial assets, or tendencies of their movements during a percific period.

Time Bar¶

Time bars are derived from regularly sampling information at a fixed time interval at frequencies of a day, an hour, or even every 15 minutes. They typically are open, close, high, low, volume - historical data structure (usually shortened to OHLCV). Although time bars are one of the most popular among practitioners, series extracted from time-sampled processes often show poor statistical properties such as serial correlation, heteroscedasticity, and non-normality of returns (Easley, Lopez de Prado, and O’Hara [2012]).

Several examples will be shown below to illustrate behaviors of mentioned bar types. Firstly, historical data of S&P500 including open, close, high, low and volume are collected by Yahoo Finance API - yfinance from a 30-day period at a minute interval.

In [1]:
import datetime
import pandas as pd
import numpy as np
import algo_structure as al
import loading as load
from algo_structure import AlgoStructure
from importlib import reload
reload(al)
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import datetime

%matplotlib inline 
import warnings
warnings.filterwarnings("ignore")
# import plotly.io as pio

# pio.renderers.default = "svg"
In [2]:
# Initial time set-up 
end_time_date = datetime.date.today() - datetime.timedelta(days=1)
end_time= datetime.datetime.fromordinal(end_time_date.toordinal())
start_time = end_time - datetime.timedelta(days=27)
# Download stock information
df,stock_df = load.scraping_stock("MSFT",start_time, end_time ,"1m")
0  - Date : 2022-10-17 00:00:00
[*********************100%***********************]  1 of 1 completed
1  - Date : 2022-10-23 00:00:00
[*********************100%***********************]  1 of 1 completed
2  - Date : 2022-10-29 00:00:00
[*********************100%***********************]  1 of 1 completed
3  - Date : 2022-11-04 00:00:00
[*********************100%***********************]  1 of 1 completed
In [3]:
# Save information and reload it for the next time

# df.to_csv('stock_raw.csv')
# stock_df.to_csv('stock_data.csv')

# df = read_csv('Data-Structures/stock_raw.csv')
# stock_df = read_csv('Data-Structures/stock_raw.csv')
In [4]:
df.head()
Out[4]:
Datetime Open High Low Close Adj Close Volume
0 2022-11-04 09:30:00-04:00 217.550003 218.479996 217.160004 217.764999 217.764999 1607523
1 2022-11-04 09:31:00-04:00 217.759995 217.910004 217.440002 217.490005 217.490005 166143
2 2022-11-04 09:33:00-04:00 217.464996 217.490005 216.699997 216.789993 216.789993 509390
3 2022-11-04 09:34:00-04:00 216.429993 216.429993 216.399994 216.399994 216.399994 124727
4 2022-11-04 09:35:00-04:00 216.089996 216.110001 216.041000 216.050003 216.050003 139943
In [5]:
stock_df.head()
Out[5]:
Timestamp Open High Low Close Adj Close Volume
0 2022-11-04 09:30:00 217.550003 218.479996 217.160004 217.764999 217.764999 1607523
1 2022-11-04 09:31:00 217.759995 217.910004 217.440002 217.490005 217.490005 166143
2 2022-11-04 09:33:00 217.464996 217.490005 216.699997 216.789993 216.789993 509390
3 2022-11-04 09:34:00 216.429993 216.429993 216.399994 216.399994 216.399994 124727
4 2022-11-04 09:35:00 216.089996 216.110001 216.041000 216.050003 216.050003 139943
In [6]:
stock_df.columns
Out[6]:
Index(['Timestamp', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')

Volume weighted average price is the average price of a stock weighted by the total trading volumne. This techninal analysis indicator is often used to calculate the price average of a stock over a period of time. To deeply understand what VWAP is, please check this definition. The plot below displays Time bar of S&P500 VWAP. We initially use the volume weighted average price (VWAP) for chart illustrations.

In [7]:
# stock_grp = stock_df.groupby(pd.Grouper(freq = '3Min'))
data_time_vwap = AlgoStructure.vwap_cal(stock_df)
data_vwap_3min = data_time_vwap.groupby(pd.Grouper(freq = '3Min')).mean()
data_vwap_3min = data_vwap_3min.reset_index(drop=False)

# for i in range(0, len(data_time_vwap), 3):
#     data_vwap_3min = data_vwap_3min.append(data_time_vwap.iloc[[i]])

data_vwap_3min.head()
Out[7]:
Timestamp index Open High Low Close Adj Close Typical_price
0 2022-11-04 09:30:00 0.5 217.654999 218.195000 217.300003 217.627502 217.627502 217.707502
1 2022-11-04 09:33:00 3.0 216.661662 216.676666 216.380330 216.413330 216.413330 216.490109
2 2022-11-04 09:36:00 6.0 215.647135 215.674998 215.478200 215.508336 215.508336 215.553845
3 2022-11-04 09:39:00 9.0 214.916667 215.240000 214.688599 214.973338 214.973338 214.967312
4 2022-11-04 09:42:00 12.0 214.416667 214.586466 214.196665 214.263331 214.263331 214.348821
In [8]:
# data_time_vwap.reset_index(drop = False, inplace=True)
fig = make_subplots(specs=[[{"secondary_y": True}]])
# df_plot = data_time_vwap.copy()
colors = {'A':'steelblue',
          'B':'firebrick'}
stock_df.reset_index(drop=False, inplace=True)
fig.add_trace(go.Scatter(x=stock_df['Timestamp'],y=stock_df['Close'],name='Vwap',marker_color=colors['A']),secondary_y=False)
# Add trade volume each timestamp
fig.add_trace(go.Bar(x=stock_df['Timestamp'],y=stock_df['Volume'],name='Volume',marker_color=colors['B']),secondary_y=True)

fig.update_xaxes(
        rangeslider_visible=True,
        rangebreaks=[
            dict(bounds=["sat", "mon"]),  # hide weekends, eg. hide sat to before mon
            dict(bounds=[16, 9.5], pattern="hour"),  # hide hours outside of 9.30am-4pm
            dict(bounds=["2022-10-12 16:00:00","2022-10-13 9:30:00"]),
            dict(bounds = ["2022-10-15 16:0:00", "2022-10-17 11:42:00"])
        ]
    )
# fig.update_traces(color='red')
fig.show()

However, it is the fact that infomation from markets is not exposed uniformly at random, which means it does not follow uniform distribution over time. Therefore, timebars over-sample low activity periods and under-sample high activity periods (activities maybe understood as changing in prices).

In [9]:
data_time_vwap.reset_index(drop = False, inplace=True)
abs_price = abs(AlgoStructure.cal_prchanges(data_time_vwap['Vwap']))
abs_price = pd.DataFrame(abs_price)
abs_price.reset_index(drop = False, inplace = True)

fig = px.histogram(abs_price, x="Vwap",
                  title = 'Histogram of absolute price change (%)',
                  labels = {'Vwap':'Absolute price change (%)'},
                  color_discrete_sequence  = [colors['A']])

fig.update_layout(
        width=800,height=350,
        margin=dict(l=30, r=30, t=38, b=5)
            )
fig.update_layout(xaxis_range=[0,0.1])
fig.show()
00.020.040.060.080.10100200300
Histogram of absolute price change (%)Absolute price change (%)count

As we can see, the majority of 5-minute activities show price changes below 0.5%; the largest part of them experiencing slight changes with the peak at around 0.0025%–0.007%. This actually not an effective input source for training any Machine Learning-based algorithm model because of the fact that there is no changes or features to learn.

Statistical methods often rely on assumptions like non-serial correlation, homogeneity of variances or nomality of returns distribution. Therefore, those poor properties confirm that time bar is not a good data format for machine learning algorithms.

In [10]:
time_bars, time_returns = AlgoStructure.cal_returns(data_time_vwap['Vwap'].astype(float))
time_bars = pd.DataFrame(time_bars)
time_bars.reset_index(drop = False, inplace = True)

fig = px.histogram(time_bars, x="Vwap",
                  title = 'Histogram of log-returns',
                  labels = {'Vwap':'Log-returns'},
                  color_discrete_sequence  = [colors['A']],
                  nbins = 200)
fig.update_layout(
        width=850,height=300,
        margin=dict(l=5, r=5, t=38, b=5)
            )
fig.update_layout(xaxis_range=[-0.0002,0.0004])

fig.show()
−200μ−100μ0100μ200μ300μ400μ0200400600800
Histogram of log-returnsLog-returnscount

It is not really normal but left-skewed.

What are the issues?

Macros Lopez de Prado also points out in the book that Time samples data has weak properties:

  • Oversampling information from low-activity periods, but undersampling information from high-activity periods.
  • Poor statistical properties:

    • Serial correlation (lag) which has high correlation of itself in a delayed timestamp

    • Heteroschedasticity when the timeseries witnessed changes in its variance/errors overtime

    • Finanally, non-normality of returns

Tick Bar¶

Tick bar or candlestick bar is an ubiquitous charting method used in market analysis, especially traders. Candlestick bar chart is a combination of many single candlesticks following a period of time. Two common definitions describing a tick are:

  • A tick is a measure of the minimum upward or downward movement in the price of a security or refer to the change in the price of a security from trade to trade.

  • A tick bar or a tick candle is the aggregation of a predefined number of ticks; in other words, each vertical line on the chart (a tick) represents the high and low for the given trading period. For instance, 100-tick bars can be generated by calculating the Open, High, Low, Close, and Volume values of 100 consecutive actual trades from the exchange.

A candlestick show two main parts:

- A real body: The rectangle shape of the tick, often be called "real body". This part represents the open and close price of a trading period.
If the close is higher than the open, the real body will be white or green, which witnesses increasing prices (the market is bullish); meanwhile if the open is higher than the close,the real body will be black or red, which shows decreasing prices (the market is bearish).

-Upper/Lower shadows: the both shadows are the lines above the real body, and represents the highest/lowest value for the trading period.

In [11]:
tick = data_time_vwap.copy()
# tick.reset_index(drop=False, inplace=True)
stick = AlgoStructure.generate_tickbars(tick,frequency=10)
df_tick = pd.DataFrame(stick, columns = ['Timestamp','Open','High','Low','Close','Volume'])
df_tick['Timestamp'] = df_tick['Timestamp'].apply(lambda t: load.floatotime(t))
df_tick.head()
Out[11]:
Timestamp Open High Low Close Volume
0 2022-11-04 09:40:00 217.764999 217.764999 214.660004 214.660004 10.0
1 2022-11-04 09:50:00 215.190002 215.190002 213.580002 213.940002 10.0
2 2022-11-04 10:00:00 214.104996 217.080002 214.104996 217.080002 10.0
3 2022-11-04 10:10:00 216.979996 217.039993 215.630005 215.865005 10.0
4 2022-11-04 10:20:00 216.330002 217.820007 216.330002 217.820007 10.0
In [12]:
# def bar(x, y):
#     return np.int64(x/y)*y
# trade.groupby(bar(np.cumsum(trade['Volume']), 1000)).agg({'Close': 'ohlc', 'Volume': 'sum'})
In [13]:
fig = go.Figure(data=[go.Candlestick(x=df_tick['Timestamp'],
                open=df['Open'],
                high=df['High'],
                low=df['Low'],
                close=df['Close'])]
                )
fig.update_xaxes(
        rangeslider_visible=True,
        rangebreaks=[
            dict(bounds=["sat", "mon"]),  # hide weekends, eg. hide sat to before mon
            dict(bounds=[16, 9.5], pattern="hour"),  # hide hours outside of 9.30am-4pm
            dict(bounds=["2022-10-12 15:59:00","2022-10-13 9:30:00"]),
            dict(bounds = ["2022-10-15 15:59:00", "2022-10-17 11:42:00"])
        ]
    )
fig.update_layout(
        title='Stock Analysis',
        yaxis_title='Price'
    )
fig.show()
print(f'Number of ticks during the period: {df.shape[0]}')
Number of ticks during the period: 1556
In [14]:
tick_bars, tick_returns = AlgoStructure.cal_returns(df_tick['Close'])
tick_bars = pd.DataFrame(tick_bars)
tick_bars.reset_index(drop = False, inplace = True)

bin_len = 8
fig = px.histogram(tick_bars, x="Close",
                  title = 'Histogram of log-returns',
                  labels = {'Close':'Log-returns'},
                  color_discrete_sequence  = [colors['A']],
                  nbins=len(np.arange(min(df_tick['Close']),
                                    max(df_tick['Close'])+bin_len, bin_len)))

fig.update_layout(
        width=850,height=300,
        margin=dict(l=5, r=30, t=38, b=5)
            )
            
fig.show()
−0.01−0.00500.0050.010.0150510152025
Histogram of log-returnsLog-returnscount
In [15]:
trade = data_time_vwap.copy()
trade_df = AlgoStructure.generate_volumebars(trade,frequency=5)
trade_df = pd.DataFrame(trade_df, columns = ['Timestamp','Open','High','Low','Close','Volume'])
trade_df['Timestamp'] = trade_df['Timestamp'].apply(lambda t: load.floatotime(t))
trade_df.head()
Out[15]:
Timestamp Open High Low Close Volume
0 2022-11-04 09:35:00 217.764999 217.764999 216.050003 216.050003 2547726.0
1 2022-11-04 09:40:00 215.970001 215.970001 214.660004 214.660004 581630.0
2 2022-11-04 09:45:00 215.190002 215.190002 213.580002 213.938004 875706.0
3 2022-11-04 09:50:00 213.990005 214.139999 213.850006 213.940002 873616.0
4 2022-11-04 09:55:00 214.104996 215.151093 214.104996 214.910004 714178.0
In [16]:
trade_bars, volume_returns = AlgoStructure.cal_returns(trade_df['Close'])
trade_bars = pd.DataFrame(trade_bars)
trade_bars.reset_index(drop = False, inplace = True)

bin_len = 5
fig = px.histogram(trade_bars, x="Close",
                  title = 'Histogram of log-returns',
                  labels = {'Close':'Log-returns'},
                  color_discrete_sequence  = [colors['A']],
                  nbins=len(np.arange(min(df_tick['Close']),
                                    max(df_tick['Close'])+bin_len, bin_len)))
fig.update_layout(
        width=850,height=300,
        margin=dict(l=5, r=30, t=38, b=5)
            )
fig.show()

Statistical Properties¶

In [17]:
bar_returns = [time_returns,tick_returns,volume_returns]
In [18]:
bar_types = ['time','tick','volume']
autocorrs = AlgoStructure.get_test_stats(bar_types,bar_returns,pd.Series.autocorr)
autocorrs.sort_values('autocorr_stat')
Out[18]:
sample_size autocorr_stat
tick 155.0 -0.192827
volume 311.0 -0.044798
time 1555.0 0.578814
In [19]:
AlgoStructure.create_corr_plot('Time bars',time_returns)
AlgoStructure.create_corr_plot('Tick bars',tick_returns)
AlgoStructure.create_corr_plot('Volume bars',volume_returns)